William Eberle,
Lawrence Holder,
Student
team: NO
In order to analyze the Paraiso cell-phone traffic,
specifically the Catalano/Vidro social structure, we used the Graph-Based
Anomaly Detection (GBAD) tool to focus the visualization on interesting
structural anomalies. Initially created
in 2006 as a joint venture between the
Two Page Summary: YES
Analyzing Catalano/Vidro Social
Structure Using GBAD
ANSWERS:
Phone-1:
What is the Catalano/Vidro social network, as reflected in the cell phone call
data, at the end of the time period
Phone-2
Characterize the changes in the Catalano/Vidro social structure over the ten
day period.
Detailed Answer:
Our Graph-Based Anomaly Detection
(GBAD) system takes a graph-representation of data and applies three algorithms
that analyze the graph for structural anomalies. Each of these algorithms is applied after the
normative graph structure has been discovered.
It is our hypothesis that such a system can discover knowledge in a graph
representation of the Cell Phone Calls (social network) data that will (1) show
the normal social structure of the Catalano’s and Vidro’s, (2) show when the
social structure has been broken, and (3) show anomalies in the social behavior
of this group, indicating possible breaches to their “inner circle”.
In order to apply GBAD to the
problem data set, we first had to convert the cell-phone calls to a graph
representation. Initially, we created a
graph from all of the data that was provided, as shown in Figure 1.
Figure 1 Graph
representation of cell-phone traffic.
In this graph representation, each call consists
of a vertex for each PERSON found in the original data set. Each time a PERSON is a SENDER or RECEIVER of
a call, a link is made between each PERSON vertex, with a corresponding
intermediary CALL vertex. Each CALL
vertex, unique to a placed call, contains three properties: the DAY the call was made, the DURATION of
the call, and the LOCATION of the call origination cell tower.
However, since the goal of this objective was to
characterize the changes in the social structure of the Catalano/Vidro network,
we then decided to forego the DAY, DURATION and LOCATION links (and their
associated vertices), and focus on just the social interactions between the
pertinent parties. Thus, we decided to
just create, in essence, a social network that indicated for a particular day,
who called who. To start with, based
upon all of the information that was provided, we made the following
assumptions about this particular data set:
Starting with these simple assumptions, and a
graph that consisted of vertices for each unique ID with links between the
vertices if there was a conversation on that day, we were able to create a
simple visualization of Ferdinando’s “inner circle” social network structure
(or Catalano/Vidro social structure) over the 10 days that data was generated
(Figure 2).
Figure 2 Graph representation
of Ferdinando's inner circle.
Figure 2 (and subsequent figures)
was rendered using AT&T’s graph visualization program (http://www.graphviz.org/). In particular, the GraphViz tool we used is
called dotty. Dotty takes a graph file, specified in the
GraphViz format, and renders the graph in a visual format (like what is shown
in Figure 2). The expected input format
to dotty consists of vertex numbers, and their corresponding labels and colors,
and edges that specify direction, label, arrowhead and color. One of the tools available with GBAD/SUBDUE
called graph2dot can convert the
GBAD/SUBDUE graph input file into the GraphViz dotty format. So, we took the graph input file, ran
graph2dot on the input file, which produced a dotty version of the graph, which
we then input into the Windows version of dotty. This “visualization” (shown in Figure 2)
shows the graph structure of interactions between people in 200’s (i.e.,
Ferdinando Catalano’s) inner circle (i.e., 200, 1, 2, 3, 5, 97 and 137). Each of the disconnected subgraphs represents
the calling-history between the targeted group members for a single day. Thus, you will notice that there are only 9
substructures in the graph. This is due
to the fact that on Day 8, nobody in 200’s inner circle talked to each
other. In other words, there were no
calls between 1, 2, 3, 5, 97, 137 or 200 on that day.
Once we had generated the graph, we
ran the graph through the GBAD system, applying all three algorithms to see if
any (or all) of them would discover any anomalous patterns in the social
structure. GBAD is a Unix command-line
tool with various parameters that control the search space. Through command-line parameters, a user can
specify various performance tuning criteria such as pruning, the size of the
search beam, and the number of substructures to be kept on the candidate
list. Since the cell-phone traffic for
this mini-challenge only consists of 400 call records, we only customized two
parameters (leaving the other parametric settings to their default
values): the number of substructures
kept on the candidate list was set to 200, and the number of GBAD iterations
was set to 2. Both of these values are
somewhat arbitrary, and were chosen due to the size of the graph and the size
of the normative pattern. The first
parameter indicates that we wanted to keep only the top 200 best substructures,
as the diversity in substructure patterns is somewhat sparse; and only two
iterations were needed because each phone call is only composed of a few
attributes, for which most of the time these attributes were included in the
normative pattern.
Figure 3 shows the original graph,
as well as the normative pattern and anomalous patterns.
Figure 3 Ferdinando's
inner circle with associated normative pattern and anomalies.
In Figure 3, the structure in green indicates the normative pattern that was
discovered in this graph. The
substructure in red indicates an anomaly that was
discovered by the GBAD-P algorithm, which analyzes a graph for anomalous
extensions to the normative pattern. In
this case, the fact that 5 called 97 was anomalous, when compared to other
instances of what was the normal social structure. The substructure in orange
indicates an anomaly that was discovered by the GBAD-MPS algorithm, which
analyzes a graph for missing structure.
In this case, the fact that 200 did not talk to 3 on that day is
considered anomalous. Again, we were
able to visualize these results by manually editing the dotty input file and
changing the “color” values based upon the results that were returned by the
GBAD system. GBAD returns the anomalies
in an ASCII text format, and does not render a graphical visualization. But, as is shown, a simple visualization of
these results using the GraphViz dotty utility shows the usefulness of this
approach.
Looking further at this
visualization of Ferdinando Catalano’s calling-history, we are able to make
several interesting observations about his social network:
Due to the above simple visualization
of this data, and what we know about the situation on Isla del Sueńo, we were
able to make these observations. We also
played with several other variants of the graph, including the “directedness”
of the graph. While we chose an
undirected graph for all of the results shown above (because we considered a
conversation between two people to be a two-way communication), we also looked
at a directed version of the graph, where the edge between two vertices was
directed going from the person who called to the person who was being
called. When we did that, we noticed
that 97 and 137 are never called by 1, 2, 3 and 5 – and they only call 5 and
200.
The advantage of this approach in
terms of this contest is that we are able to show how a simpler visualization
technique, combined with graph mining tools, can help an analyst focus their
search for relevant information in what can be a fairly complex network of
communications. Traditional data mining
approaches involve the probabilities and distributions of data values, while a
graph-based approach such as this can discover differences in data when
structure and relationships are represented as nodes and links.
While the cell-phone calls were made over a ten-day period, as shown
above we chose to represent each day as an individual sub-graph. In the future, we will investigate the
representation of dynamic graphs which would then allow us to indicate a change
in the structure over time.